都会|表记_SparkStreaming源码解读之State管理之updateStateByKey和mapWithState解密

作者：狮子小刚刚 | 来源：互联网 | 2023-09-05 18:36

篇首语：本文由编程笔记#小编为大家整理，主要介绍了SparkStreaming源码解读之State管理之updateStateByKey和mapWithState解密相关的知识，希望对你有一定

篇首语：本文由编程笔记#小编为大家整理，主要介绍了Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密相关的知识，希望对你有一定的参考价值。

背景&＃xff1a;
整个Spark Streaming是按照Batch Duractions划分Job的。但是很多时候我们需要算过去的一天甚至一周的数据&＃xff0c;这个时候不可避免的要进行状态管理&＃xff0c;而Spark Streaming每个Batch Duractions都会产生一个Job&＃xff0c;Job里面都是RDD&＃xff0c;所以此时面临的问题就是怎么对状态进行维护&＃xff1f;这个时候就需要借助updateStateByKey和mapWithState方法完成核心的步骤。
源码分析&＃xff1a;
1. 无论是updateStateByKey还是mapWithState方法在DStream中均没有&＃xff0c;但是是通过隐身转换函数实现其功能。

object DStream // &＃96;toPairDStreamFunctions&＃96; was in SparkContext before 1.3 and users had to // &＃96;import StreamingContext._&＃96; to enable it. Now we move it here to make the compiler find // it automatically. However, we still keep the old function in StreamingContext for backward // compatibility and forward to the following function directly. implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] &＃61; null): PairDStreamFunctions[K, V] &＃61; new PairDStreamFunctions[K, V](stream)

updateStateByKey&＃xff1a;
1. 在PairDStreamFunctions中updateStateByKey具体实现如下&＃xff1a;
在已有的历史基础上&＃xff0c;updateFunc对历史数据进行更新。该函数的返回值是DStream类型的。

/** * Return a new "state" DStream where the state for each key is updated by applying * the given function on the previous state of the key and the new values of each key. * Hash partitioning is used to generate the RDDs with Spark&＃39;s default number of partitions. * &＃64;param updateFunc State update function. If &＃96;this&＃96; function returns None, then * corresponding state key-value pair will be eliminated. * &＃64;tparam S State type */ def updateStateByKey[S: ClassTag]( updateFunc: (Seq[V], Option[S]) &＃61;> Option[S] ): DStream[(K, S)] &＃61; ssc.withScope // defaultPartitioner updateStateByKey(updateFunc, defaultPartitioner())

2. defaultPartitioner:

private[streaming] def defaultPartitioner(numPartitions: Int &＃61; self.ssc.sc.defaultParallelism) &＃61; new HashPartitioner(numPartitions)

3. partitioner就是控制RDD的每个patition

/** * Return a new "state" DStream where the state for each key is updated by applying * the given function on the previous state of the key and the new values of the key. * org.apache.spark.Partitioner is used to control the partitioning of each RDD. * &＃64;param updateFunc State update function. If &＃96;this&＃96; function returns None, then * corresponding state key-value pair will be eliminated. * &＃64;param partitioner Partitioner for controlling the partitioning of each RDD in the new * DStream. * &＃64;tparam S State type */ def updateStateByKey[S: ClassTag]( updateFunc: (Seq[V], Option[S]) &＃61;> Option[S], partitioner: Partitioner ): DStream[(K, S)] &＃61; ssc.withScope val cleanedUpdateF &＃61; sparkContext.clean(updateFunc) val newUpdateFunc &＃61; (iterator: Iterator[(K, Seq[V], Option[S])]) &＃61;> iterator.flatMap(t &＃61;> cleanedUpdateF(t._2, t._3).map(s &＃61;> (t._1, s))) updateStateByKey(newUpdateFunc, partitioner, true)

4. rememberPartitioner默认为true

/** * Return a new "state" DStream where the state for each key is updated by applying * the given function on the previous state of the key and the new values of each key. * org.apache.spark.Partitioner is used to control the partitioning of each RDD. * &＃64;param updateFunc State update function. Note, that this function may generate a different * tuple with a different key than the input key. Therefore keys may be removed * or added in this way. It is up to the developer to decide whether to * remember the partitioner despite the key being changed. * &＃64;param partitioner Partitioner for controlling the partitioning of each RDD in the new * DStream * &＃64;param rememberPartitioner Whether to remember the paritioner object in the generated RDDs. * &＃64;tparam S State type */ def updateStateByKey[S: ClassTag]( updateFunc: (Iterator[(K, Seq[V], Option[S])]) &＃61;> Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean ): DStream[(K, S)] &＃61; ssc.withScope new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)

5. 在StateDStream中&＃xff0c;StorageLevel是直接存储到磁盘&＃xff0c;因为此时的数据非常大

class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag]( parent: DStream[(K, V)], updateFunc: (Iterator[(K, Seq[V], Option[S])]) &＃61;> Iterator[(K, S)], partitioner: Partitioner, preservePartitioning: Boolean, initialRDD : Option[RDD[(K, S)]] ) extends DStream[(K, S)](parent.ssc) super.persist(StorageLevel.MEMORY_ONLY_SER)

在computeUsingPreiviousRDD源码如下&＃xff1a;

private [this] def computeUsingPreviousRDD ( parentRDD : RDD[(K, V)], prevStateRDD : RDD[(K, S)]) &＃61; // Define the function for the mapPartition operation on cogrouped RDD; // first map the cogrouped tuple to tuples of required type, // and then apply the update function val updateFuncLocal &＃61; updateFunc val finalFunc &＃61; (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) &＃61;> val i &＃61; iterator.map(t &＃61;> val itr &＃61; t._2._2.iterator val headOption &＃61; if (itr.hasNext) Some(itr.next()) else None (t._1, t._2._1.toSeq, headOption) ) updateFuncLocal(i) //cogroup每次计算的时候都会遍历prevSrateRDD中的所有parititioner的信息 // val cogroupedRDD &＃61; parentRDD.cogroup(prevStateRDD, partitioner) val stateRDD &＃61; cogroupedRDD.mapPartitions(finalFunc, preservePartitioning) Some(stateRDD)

所以&＃xff0c;如果数据很多的时候不建议使用updateStateByKey。
updateStateByKey函数实现如下&＃xff1a;

mapWithState:
1. 返回MapWithStateDStream函数&＃xff0c;维护和更新历史状态都是基于Key。使用一个function对key-value形式的数据进行状态维护。

/** * :: Experimental :: * Return a [[MapWithStateDStream]] by applying a function to every key-value element of * &＃96;this&＃96; stream, while maintaining some state data for each unique key. The mapping function * and other specification (e.g. partitioners, timeouts, initial state data, etc.) of this * transformation can be specified using [[StateSpec]] class. The state data is accessible in * as a parameter of type [[State]] in the mapping function. * * Example of using &＃96;mapWithState&＃96;: * * // A mapping function that maintains an integer state and return a String //此时的state就可以看成一张表&＃xff0c;这张表记录了状态维护中所有的历史状态。 * def mappingFunction(key: String, value: Option[Int], state: State[Int]): Option[String] &＃61; * // Use state.exists(), state.get(), state.update() and state.remove() * // to manage state, and return the necessary string * * * val spec &＃61; StateSpec.function(mappingFunction).numPartitions(10) * * val mapWithStateDStream &＃61; keyValueDStream.mapWithState[StateType, MappedType](spec) * * * &＃64;param spec Specification of this transformation * &＃64;tparam StateType Class type of the state data * &＃64;tparam MappedType Class type of the mapped data */ &＃64;Experimental def mapWithState[StateType: ClassTag, MappedType: ClassTag]( spec: StateSpec[K, V, StateType, MappedType] ): MapWithStateDStream[K, V, StateType, MappedType] &＃61; new MapWithStateDStreamImpl[K, V, StateType, MappedType]( self, // StateSpecImpl类封装了StateSpec操作。 spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]] )

2. MapWithStateDStream源码如下:

/** * :: Experimental :: * DStream representing the stream of data generated by &＃96;mapWithState&＃96; operation on a * [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair DStream]]. * Additionally, it also gives access to the stream of state snapshots, that is, the state data of * all keys after a batch has updated them. * * &＃64;tparam KeyType Class of the key * &＃64;tparam ValueType Class of the value * &＃64;tparam StateType Class of the state data * &＃64;tparam MappedType Class of the mapped data */ &＃64;Experimental sealed abstract class MapWithStateDStream[KeyType, ValueType, StateType, MappedType: ClassTag]( ssc: StreamingContext) extends DStream[MappedType](ssc) /** Return a pair DStream where each RDD is the snapshot of the state of all the keys. */ def stateSnapshots(): DStream[(KeyType, StateType)] /** Internal implementation of the [[MapWithStateDStream]] */ private[streaming] class MapWithStateDStreamImpl[ KeyType: ClassTag, ValueType: ClassTag, StateType: ClassTag, MappedType: ClassTag]( dataStream: DStream[(KeyType, ValueType)], spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType]) extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) private val internalStream &＃61; new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec) override def slideDuration: Duration &＃61; internalStream.slideDuration override def dependencies: List[DStream[_]] &＃61; List(internalStream) //计算的时候是通过InternalMapWithStateDStream来实现的。 override def compute(validTime: Time): Option[RDD[MappedType]] &＃61; internalStream.getOrCompute(validTime).map _.flatMap[MappedType] _.mappedData

3. 更新历史数据。

/** * A DStream that allows per-key state to be maintains, and arbitrary records to be generated * based on updates to the state. This is the main DStream that implements the &＃96;mapWithState&＃96; * operation on DStreams. * * &＃64;param parent (key, value) stream that is the source * &＃64;param spec Specifications of the mapWithState operation * &＃64;tparam K Key type * &＃64;tparam V Value type * &＃64;tparam S Type of the state maintained * &＃64;tparam E Type of the mapped data */ private[streaming] class InternalMapWithStateDStream[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag]( parent: DStream[(K, V)], spec: StateSpecImpl[K, V, S, E]) extends DStream[MapWithStateRDDRecord[K, S, E]](parent.context) //不断的更新内存数据结构。 persist(StorageLevel.MEMORY_ONLY)

4. MapWithStateDStream.Compute

/** Method that generates a RDD for the given time */ override def compute(validTime: Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] &＃61; // Get the previous state or create a new empty state RDD val prevStateRDD &＃61; getOrCompute(validTime - slideDuration) match case Some(rdd) &＃61;> if (rdd.partitioner !&＃61; Some(partitioner)) // If the RDD is not partitioned the right way, let us repartition it using the // partition index as the key. This is to ensure that state RDD is always partitioned // before creating another state RDD using it MapWithStateRDD.createFromRDD[K, V, S, E]( rdd.flatMap _.stateMap.getAll() , partitioner, validTime) else rdd case None &＃61;> MapWithStateRDD.createFromPairRDD[K, V, S, E]( spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)), partitioner, validTime ) //基于时间窗口创建RDD // Compute the new state RDD with previous state RDD and partitioned data RDD // Even if there is no data RDD, use an empty one to create a new state RDD val dataRDD &＃61; parent.getOrCompute(validTime).getOrElse context.sparkContext.emptyRDD[(K, V)] val partitionedDataRDD &＃61; dataRDD.partitionBy(partitioner) val timeoutThresholdTime &＃61; spec.getTimeoutInterval().map interval &＃61;> (validTime - interval).milliseconds Some(new MapWithStateRDD( prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))

5. MapWithStateRDD: 是一个RDD&＃xff0c;他本身包含了对mapWithState操作的数据&＃xff0c;以及对数据怎么操作&＃xff0c;MapWithStateRDDRecord代表了每个RDD的partition。

/** * RDD storing the keyed states of &＃96;mapWithState&＃96; operation and corresponding mapped data. * Each partition of this RDD has a single record of type [[MapWithStateRDDRecord]]. This contains a * [[StateMap]] (containing the keyed-states) and the sequence of records returned by the mapping * function of &＃96;mapWithState&＃96;. * &＃64;param prevStateRDD The previous MapWithStateRDD on whose StateMap data &＃96;this&＃96; RDD * will be created * &＃64;param partitionedDataRDD The partitioned data RDD which is used update the previous StateMaps * in the &＃96;prevStateRDD&＃96; to create &＃96;this&＃96; RDD * &＃64;param mappingFunction The function that will be used to update state and return new data * &＃64;param batchTime The time of the batch to which this RDD belongs to. Use to update * &＃64;param timeoutThresholdTime The time to indicate which keys are timeout */ private[streaming] class MapWithStateRDD[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag]( private var prevStateRDD: RDD[MapWithStateRDDRecord[K, S, E]], private var partitionedDataRDD: RDD[(K, V)], mappingFunction: (Time, K, Option[V], State[S]) &＃61;> Option[E], batchTime: Time, timeoutThresholdTime: Option[Long] ) extends RDD[MapWithStateRDDRecord[K, S, E]]( partitionedDataRDD.sparkContext, List( new OneToOneDependency[MapWithStateRDDRecord[K, S, E]](prevStateRDD), new OneToOneDependency(partitionedDataRDD)) ) &＃64;volatile private var doFullScan &＃61; false require(prevStateRDD.partitioner.nonEmpty) require(partitionedDataRDD.partitioner &＃61;&＃61; prevStateRDD.partitioner) override val partitioner &＃61; prevStateRDD.partitioner override def checkpoint(): Unit &＃61; super.checkpoint() doFullScan &＃61; true override def compute( partition: Partition, context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] &＃61; val stateRDDPartition &＃61; partition.asInstanceOf[MapWithStateRDDPartition] val prevStateRDDIterator &＃61; prevStateRDD.iterator( stateRDDPartition.previousSessionRDDPartition, context) val dataIterator &＃61; partitionedDataRDD.iterator( stateRDDPartition.partitionedDataRDDPartition, context) val prevRecord &＃61; if (prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else None val newRecord &＃61; MapWithStateRDDRecord.updateRecordWithData( prevRecord, dataIterator, mappingFunction, batchTime, timeoutThresholdTime, removeTimedoutData &＃61; doFullScan // remove timedout data only when full scan is enabled ) Iterator(newRecord)

6. updateRecordWithData: RDD本身不可变的&＃xff0c;但是可以处理变化的数据。

def updateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag]( prevRecord: Option[MapWithStateRDDRecord[K, S, E]], dataIterator: Iterator[(K, V)], mappingFunction: (Time, K, Option[V], State[S]) &＃61;> Option[E], batchTime: Time, timeoutThresholdTime: Option[Long], removeTimedoutData: Boolean ): MapWithStateRDDRecord[K, S, E] &＃61; // Create a new state map by cloning the previous one (if it exists) or by creating an empty one val newStateMap &＃61; prevRecord.map _.stateMap.copy() . getOrElse new EmptyStateMap[K, S]() val mappedData &＃61; new ArrayBuffer[E] val wrappedState &＃61; new StateImpl[S]() // Call the mapping function on each record in the data iterator, and accordingly // update the states touched, and collect the data returned by the mapping function dataIterator.foreach case (key, value) &＃61;> wrappedState.wrap(newStateMap.get(key)) val returned &＃61; mappingFunction(batchTime, key, Some(value), wrappedState) if (wrappedState.isRemoved) newStateMap.remove(key) else if (wrappedState.isUpdated || timeoutThresholdTime.isDefined) //遍历当前所有batchTime的所有数据&＃xff0c;然后使用自定义的函数对当前的batch数据进行计算&＃xff0c;更新newStateMap数据结构。 // newStateMap是保存历史数据 newStateMap.put(key, wrappedState.get(), batchTime.milliseconds) mappedData &＃43;&＃43;&＃61; returned // Get the timed out state records, call the mapping function on each and collect the // data returned if (removeTimedoutData && timeoutThresholdTime.isDefined) newStateMap.getByTime(timeoutThresholdTime.get).foreach case (key, state, _) &＃61;> wrappedState.wrapTimingOutState(state) val returned &＃61; mappingFunction(batchTime, key, None, wrappedState) mappedData &＃43;&＃43;&＃61; returned newStateMap.remove(key) // MapWithStateRDDRecord所代表的partition&＃xff0c;从RDD的角度来说&＃xff0c;没有变。但是内部变了。只是内部数据发送变化了。 MapWithStateRDDRecord(newStateMap, mappedData)